May 14th 2020

Introduction to the study

  • Raw data
## # A tibble: 242 x 100
##   Snake Reference Note  `SVMP (Snake Ve… `PI-SVMP (Snake… `PII-SVMP (Snak…
##   <chr> <chr>     <chr>            <dbl>            <dbl>            <dbl>
## 1 Agki… https://… Mexi…             24.5                0                0
## 2 Agki… https://… Cost…             30.8                0                0
## 3 Agki… https://… Mexi…             30.6                0                0
## 4 Agki… https://… Orig…             32.5                0                0
## # … with 238 more rows, and 94 more variables: `PIII-SVMP (Snake Venom
## #   Metalloproteinase PIII), %` <dbl>, …
  • Newly found data
## # A tibble: 27 x 4
##   Toxin               `Vipera aspis asp… `Vipera berus ber… `Vipera anatolica s…
##   <chr>                            <dbl>              <dbl>                <dbl>
## 1 SVMP (Snake Venom …               13.4                 NA                 42.9
## 2 3Ftx (three-finger…               NA                   NA                 NA  
## 3 Unknown peptides                  NA                   NA                 23.5
## 4 PLA2 (Phospholipas…               30.9                 NA                  8.2
## # … with 23 more rows

Goal of study

  • Develop a tool for venom composition analysis
  • Group snakes by family based on venom composition (PCA, K-means, ANN)

Project outline

  • Loading and cleaning data
    • Map locations to country
  • Augmentation of data
    • Merge datasets
    • Create genus and species columns
    • Group toxins
  • Analysis and visualisations
    • Geographical and genus distribution
    • Venom composition analysis
  • Unsupervised analysis
    • PCA
    • K-means clustering
  • Supervised classification model
    • Artificial Neural Network (ANN)

Materials and methods

  • Data processing and modelling as well as the creation of this presentation was performed in Rstudio Cloud.

  • Coding followed the tidyverse style guide by Hadley Wickham.

  • Results obtained from modelling using Artificial Neural Networks were performed in another project.

  • Whole project exists at github at: https://github.com/rforbiodatascience/2020_group04

Used packages: httr, readxl, tidyverse, knitr, plotly, maps, patchwork, shiny, rsconnect, keras, devtools

Tidying and transforming data

Tidying and transforming data

  • Tidy raw data
    • Load and clean data
  • Transform data
    • Join new data
    • Group toxins
    • Remove toxins found in fewer than five snakes
    • Map genus to snake family
<<<<<<< HEAD
## # A tibble: 233 x 37
##   Snake Genus Species Family Country Reference SVMPi `DC-fragment` CRISP   PLB
##   <chr> <chr> <chr>   <chr>  <chr>   <chr>     <dbl>         <dbl> <dbl> <dbl>
## 1 Agki… Agki… biline… Viper… Mexico  https://…     0           0    0        0
## 2 Agki… Agki… biline… Viper… Costa … https://…     0           0    0        0
## 3 Agki… Agki… biline… Viper… Mexico  https://…     0           0    5.6      0
## 4 Agki… Agki… contor… Viper… Unknown https://…     0           0.1  3.7      0
## 5 Agki… Agki… contor… Viper… USA     https://…     0           0    1.96     0
## 6 Agki… Agki… contor… Viper… USA     https://…     0           0    0        0
## 7 Agki… Agki… contor… Viper… USA     https://…     0           0    1.9      0
## # … with 226 more rows, and 27 more variables: Crotoxin <dbl>, …
=======
## # A tibble: 233 x 43
##   Snake Genus Species Family Country Reference SVMPi `DC-fragment` CRISP `3Ftx`
##   <chr> <chr> <chr>   <chr>  <chr>   <chr>     <dbl>         <dbl> <dbl>  <dbl>
## 1 Agki… Agki… biline… Viper… Mexico  https://…     0           0    0         0
## 2 Agki… Agki… biline… Viper… Costa … https://…     0           0    0         0
## 3 Agki… Agki… biline… Viper… Mexico  https://…     0           0    5.6       0
## 4 Agki… Agki… contor… Viper… Unknown https://…     0           0.1  3.7       0
## 5 Agki… Agki… contor… Viper… USA     https://…     0           0    1.96      0
## 6 Agki… Agki… contor… Viper… USA     https://…     0           0    0         0
## 7 Agki… Agki… contor… Viper… USA     https://…     0           0    1.9       0
## # … with 226 more rows, and 33 more variables: PLB <dbl>, …
>>>>>>> dfef388d0617d012db6c7f1725505229298ca6cc

Analysis and visualisations

Geographical overview of samples

Snakes from richer countries or countries with a focus on snake research is overrepresented.

<<<<<<< HEAD
=======
>>>>>>> dfef388d0617d012db6c7f1725505229298ca6cc

Genus distribution according to family

Venom composition in snake families

<<<<<<< HEAD

Venom composition in snake families

=======

Venom composition in snake families

>>>>>>> dfef388d0617d012db6c7f1725505229298ca6cc

Toxin abundances

Comparing venom composition between snake species

<<<<<<< HEAD

Comparing venom composition within species

=======

Comparing venom composition within species

>>>>>>> dfef388d0617d012db6c7f1725505229298ca6cc

Shiny app

Unsupervised and supervised learning

Results from PCA and K-means

Prediction model based on venom composition

A simple vanilla ANN managed to correctly classify the whole testset (25 % of data).

  • Specifications: 4 hidden neurons, learning rate = 0.001, n_epocs = 100, loss criterion = Binary Crossentropy.

Theoretic analysis of incorrect labels

To investigate a case with misclassified snakes, a new model was trained with a test size of 40%, 5 snakes are misclassified as illustrated below:

Analysis of special cases

Incorrectly labeled snakes by sub-optimal ANN:

## # A tibble: 5 x 2
##   Snake                    Family   
##   <chr>                    <chr>    
## 1 Daboia russelii russelii Viperidae
## 2 Hydrophis cyanocinctus   Elapidae 
## 3 Micropechis ikaheka      Elapidae 
## 4 Naja kaouthia            Elapidae 
## 5 Naja kaouthia            Elapidae

Snake from K-means cluster 2:

## # A tibble: 1 x 2
##   Snake             Family  
##   <chr>             <chr>   
## 1 Bungarus candidus Elapidae

Shiny app

Static plots for publication

Comparing venom composition between snake species

<<<<<<< HEAD

Comparing venom composition within species

=======

Comparing venom composition within species

>>>>>>> dfef388d0617d012db6c7f1725505229298ca6cc